Family relationships: should consensus reign? - consensus clustering for protein families

نویسندگان

  • Macha Nikolski
  • David James Sherman
چکیده

MOTIVATION Reliable identification of protein families is key to phylogenetic analysis, functional annotation and the exploration of protein function diversity in a given phylogenetic branch. As more and more complete genomes are sequenced, there is a need for powerful and reliable algorithms facilitating protein families construction. RESULTS We have formulated the problem of protein families construction as an instance of consensus clustering, for which we designed a novel algorithm that is computationally efficient in practice and produces high quality results. Our algorithm uses an election method to construct consensus families from competing clustering computations. Our consensus clustering algorithm is tailored to serve the specific needs of comparative genomics projects. First, it provides a robust means to incorporate results from different and complementary clustering methods, thus avoiding the need for an a priori choice that may introduce computational bias in the results. Second, it is suited to large-scale projects due to the practical efficiency. And third, it produces high quality results where families tend to represent groupings by biological function. AVAILABILITY This method has been used for Génolevures project to compute protein families of Hemiascomycetous yeasts. The data are available online at http://cbi.labri.fr/Genolevures/fam/

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Entropy-based Consensus for Distributed Data Clustering

The increasingly larger scale of available data and the more restrictive concerns on their privacy are some of the challenging aspects of data mining today. In this paper, Entropy-based Consensus on Cluster Centers (EC3) is introduced for clustering in distributed systems with a consideration for confidentiality of data; i.e. it is the negotiations among local cluster centers that are used in t...

متن کامل

A Survey of Social Factors Influencing Social Consensus(Case Study: Bushehr Civic Families)

The aim of this research is to study social factors influencing on social consensus. Sampling method was multi-process and included cluster and multistage sampling and sample size based on Cochran's Formula was 380 persons too. Data collection tools was questionnaire. In this research, the methods of data analysis were independent T-Test, Spearman Correlation Coefficient, Multivariate Regressio...

متن کامل

Towards an understanding of protein-protein interaction network heirarchies. Analysis of DnaN-binding peptide motifs in members of protein families interacting with the eubacterial processivity clamp, the subunit of DNA Polymerase III

The consensus pentapeptide QL[SD]LF is a major component in the interaction of a number of families of proteins with the eubacterial DNA-clamp protein, DnaN (the β-subunit of DNA Polymerase III holoenzyme). Rankings of the motifs were established using the program MEME. The distribution of ranking of motifs in the PolC, DinB2 and UmuC protein families were shown to be significantly skewed to hi...

متن کامل

PairsDB atlas of protein sequence space

Sequence similarity/database searching is a cornerstone of molecular biology. PairsDB is a database intended to make exploring protein sequences and their similarity relationships quick and easy. Behind PairsDB is a comprehensive collection of protein sequences and BLAST and PSI-BLAST alignments between them. Instead of running BLAST or PSI-BLAST individually on each request, results are retrie...

متن کامل

SYSTERS, GeneNest, SpliceNest: exploring sequence space from genome to protein

We have integrated the protein families from SYSTERS and the expressed sequence tag (EST) clusters from our database GeneNest with SpliceNest, a new database mapping EST contigs into genomic DNA. The SYSTERS protein sequence cluster set provides an automatically generated classification of all sequences of the SWISS-PROT, TrEMBL and PIR databases into disjoint protein family and superfamily clu...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:
  • Bioinformatics

دوره 23 2  شماره 

صفحات  -

تاریخ انتشار 2007